Tesis – Related Work

6.2 Related Work

Existing sign language corpora or datasets are usually comprised of videos of utterances, whether isolated signs or phrases. Many are not intended for research, but rather for regular use, and are structured as dictionaries.

Spread the Sign¹ (Hilzensauer y Krammer 2015) and DILSE² (Moreno 2012) are two of these dictionaries containing the Sign Language translations (as video) of words in one or many oral languages. These videos are provided without any phonetic annotation, though DILSE is interesting in that, additionally to the video, it includes static photographs of signers where movement is annotated using superposed arrows, and when needed, different instants of the sign are recorded as consecutive photographs. We see this as an approach halfway to SignWriting, which follows similar principles but in a more abstract and standardized manner.

Other datasets, even if often also intended to be usable as dictionaries, are structured to allow research by examining the data or searching by features instead of just by meaning. LSE-Sign (Gutierrez-Sigut et al. 2016) is a web tool that contains 2400 signs from Spanish Sign Language, annotated with linguistic features to enable searching for concrete characteristics of signs. The signs are stored as videos and glosses, but the annotation is rich, with entries for hand shape and orientation, movement shape, and other features. Other such corpora of sign language videos exist, such as those for Australian Sign Language, British Sign Language, and others, based on the Signbank software (Cassidy et al. 2018).

A common necessity of sign language corpora is a relevant and meaningful annotation of the signs depicted, since video by itself is not computationally processable. To this end, phonological or phonetic transcriptions of signs can be used, but there is not a universally accepted way to represent the movements and gestures of sign language neither formally nor computationally. The forefather of sign language linguistics, William Stokoe, proposed a linear writing system consisting of abstract symbols to encode the different parameters of the language (Stokoe 1960). The Hamburg Notation System (Hanke 2004) uses a similar approach but with a different set of symbols, while Ángel Herrero Blanco, in his study of Spanish Sign Language, developed another featured writing system but using characters from the roman alphabet (Herrero Blanco 2003).

A different approach to sign language transcription can be found in SignWriting (Sutton y Frost 2008), a featural writing system (Galea 2014, 76-77) where abstract symbols are used for representing linguistic features. Their shape is chosen as iconic as possible, helping the reader and writer remember the actual physical articulator the symbol represents. The main difference that SignWriting introduces is that symbols are also arranged iconically, instead of in a linear fashion. The bi-dimensional page is used to represent three-dimensional sign space, and symbols are set on it according to their actual location in the realization of the sign. This enables the writer to capture the spatial richness of sign language almost directly, but is a radical departure from the main paradigm of oral writing systems.

One of the problems this presents is that the common computational representation of oral writing systems, as sequences of individual and mostly independent symbols, is insufficient for representing SignWriting. Nonetheless, there is ongoing effort to solve this problem, and some of SignWriting can be represented using Unicode.

Unicode is a “universal character encoding standard for written characters and text” (The Unicode Consortium 2021) which assigns a number to each possible character in use in a documented human language, so that text can be computationally stored as a sequence of bytes. It includes character points and combinations for many of the symbols in SignWriting, and The International SignWriting Alphabet (Sutton y Slevinski 2010) provides fonts which, when installed in the user’s computer, allow for the proper display of the symbols.

However, “the spatial arrangement of the symbols (…) constitutes a higher-level protocol beyond the scope of the Unicode Standard” (The Unicode Consortium 2021, 831), meaning Unicode is not enough to fully codify SignWriting. To solve this problem, computational solutions such as Formal SignWriting or SignWriting Markup Language (Rocha Costa y Dimuro 2002; Verdu Perez et al. 2017; Slevinski 2016) often store positional information as numerical coordinates alongside the Unicode bytes, indicating where to place them in bi-dimensional symbol space. Nonetheless, these systems are intended mainly for creation and display of SignWriting, not for its linguistic processing, and thus lack many annotations needed for the fully automatic understanding of the transcriptions.

Compared to video-based corpora of sign language, there are not many databases of sign language which use SignWriting as their representation form. SignPuddle³ is a dictionary and database of sign language which stores SignWriting using Unicode and storing symbol positions as coordinate pairs. It is a multilingual dictionary, containing entries for many different sign languages across the world, including Spanish Sign Language. The web interface allows searching by word, symbol or searching full signs, and data can be exported for offline processing. However, since it uses the computational systems mentioned before, it does not contain the “higher-level protocol” information identified by the Unicode Standard and needed for decoding the complete meaning of transcriptions.